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Abstract 

This paper describes a set of comparative exper- 
iments, including cross-corpus evaluation, be- 
tween five alternative algorithms for supervised 
Word Sense Disambiguation (WSD), namely 
Naive Bayes, Exemplar-based learning, SNoW^ 
Decision Lists, and Boosting. Two main conclu- 



sions can 



rithm nutpnrfnrmf thn ntlinr fnur ftnto of thn ^OOO), Neural Networks (|Towell and Voorhees 



be drawn: 1) The LazyBoosting algo Lists ([Yarowsky 1994 



WSD. Generally, supervised approaches (those 
that learn from previously semantically anno- 
tated corpus) have obtained better results than 
unsupervised methods on small sets of selected 
ambiguous words, or artificial pseudo- words. 
Many standard ML algorithms for supervised 
learning have been applied, such as: Decision 

[Agi rre and Martinez 



art algorithms in terms of accuracy 



to tuno to now domains; 2) Tho domain dopon 
dence of WSD systems seems very strong and 
suggests that some kind of adaptation or tun- 
ing is required for cross-corpus application. 

1 Introduction 

Word Sense Disambiguation (WSD) is the prob- 
lem of assigning the appropriate meaning (or 



and abi l ity 1998 ), Bayesian learning ( |Bruce and Wiebe 



1999 ), Exemplar-Based learning ( Ng, 1997a| ), 



[ sensej to a given word m a text or discourse. 
feesulving Llie ainbiguiLy of wuids is a cenLial 
problem for large scale language understand- 
ing applications and their associate tasks (|Ide| 
and Veronis, 199^ ). Besides, WSD is one of the 
most important open problems in NLP. Despite 
the wide range of approaches investigated ( |Kil- 
garriff and Rosenzweig, 2000| ) and the large ef- 
fort devoted to tackle this problem, to date, no 
large-scale broad-coverage and highly accurate 
WSD system has been built. 

One of the most successful current lines of 
research is the corpus-based approach in which 
statistical or Machine Learning (ML) algorithms 
have been applied to learn statistical models 
or classifiers from corpora in order to perform 



and Boos ting ([Escudero et al., 2000a ), etc. Fur- 
ther, in ( Mooney, 1996| ) some of the previous 
methods are compared jointly with Decision 
Trees and Rule Induction algorithms, on a very 
restricted domain. 

Although some comparative studies be- 
tween alternative algorithms have been re- 
ported (iMooney 1996|; |Ng, 1997a|; [Escudero 



et al., 2000a ; Escudero et al., 2000b[ ), none of 
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them addresses the issue of the portability of 
supervised ML algorithms for WSD, i.e. to test 
whether the accuracy of a system trained on 
a certain corpus can be extrapolated to other 
corpora or not. We think that the study of the 
domain dependence of WSD — in the style of 
other studies devot ed to parsing ( pekine, 1997 ; 
[Ratnaparkhi, 199"9| ) — is needed to assess the va- 
lidity of the supervised approach, and to deter- 
mine to which extent a pre-process of tuning is 
necessary to make real WSD systems portable. 
In this direction, this work compares five differ- 
ent M L algorithms and explores their portability 
and tuning ability by training and testing them 
on different corpora. 

2 Learning Algorithms Tested 

Naive-Bayes (NB). Naive Bayes is intended 
as a simple representative of statistical learning 
methods. It has been used in its most classical 



setting (Duda and Hart, 1973). That is, assum- 
ing independence of features, it classifies a new 
example by assigning the class that maximizes 
the conditional probability of the class given the 
observed sequence of features of that example. 
Model probabilities are estimated during 



correction, WSD, etc. 

Decision Lists (DL). In this setting. Deci- 
sion Lists are ordered lists of features extracted 
from the training examples and weighted by 



jlaLive fit 



*: 



Uaining piuuesb using lelaLive iiequenues. 
avoid — Lhe elfecL of zero counts — when esLi- 
mating probabilities, a very simple smoothing 
technique has been used, which was proposed 
in (|Ng, 1997a| ). 

Despite its simplicity. Naive Bayes is claimed 
to obtain state-of-the-art accuracy on super- 



a log-likelihood measure ( Yarowsky, 1994 ). 
The aproximation described in (lAgirre and 



Martinez, 20001 ) has been fully used (using also 



vised WSD in many papers ( Mooney, 1996 ; Ng. 
1997a| ; [Leacock et al., 1998| ). 



Exemplar-based Classifier (EB). In exem- 
plar, instance, or memory-based learning (Aha 
|et al., 1991 ) no generalization of training ex- 
amples is performed. Instead, the examples 
are stored in memory and the classification of 
new examples is based on the classes of the 
most similar stored examples. In our implemen- 
tation, all examples are kept in memory and 
the classification of a new example is based on 
a fc-NN (Nearest-Neighbours) algorithm using 
Hamming distance to measure closeness. For 
fc's greater than 1, the resulting sense is the 
weighted majority sense of the k nearest neigh- 
bours — where each example votes its sense with 
a strength proportional to its closeness to the 
test example. 

Exemplar-based learning is said to be the 
best o ption for WSD (|Ng, 1997aD . Other au- 
thors dDaelemans et al., 1999| ) point out that 
exemplar-based methods tend to be superior in 
language learning problems because they do not 
forget exceptions. 

SNoW: A Winnow based Classifier. 

5A/^o IF stands for Sparse Network Of Winnows, 
and it is intended as a representative of on-line 
learning algorithms. In the SNo W architecture 
there is a Winnow ( |Littlestone, 1988 ) node for 
each class, which learns to separate that class 
from all the rest. In this paper, our approach 
to WSD using SNoW follows that of ( |Escudero| 
et al., 2000^ ). 



SNo W is proven to perform very well in high 
dimensional domains, where both, the training 
examples and the target function reside very 
sparsely in the feature space ( |Roth, 1998|) , e.g: 
text categorization, context-sensitive spelling 



their pruning and smoothing techniques). 

Decision Lists were one of the most succes- 
ful systems on the 1st edition of the Senseval 
competition ( [Kilgarriff and Rosenzweig, 2000 ). 

LazyBoosting (LB). The main idea of boost- 
ing algorithms is to combine many simple and 
moderately accurate hypotheses (called weak 
classifiers) into a single, highly accurate clas- 
sifier. The weak classifiers are trained sequen- 
tially and, conceptually, each of them is trained 
on the examples which were most difficult to 
classify by the preceding weak classifiers. 

LazyBoosting ( Escudero et al., 2000a|) , is a 
simple modification of the AdaBoost.MH algo- 
rithm ( Schapire and Singer, to appear|) , which 
consists of reducing the feature space that is ex- 
plored when learning each weak classifier. More 
specifically, a small proportion of attributes are 
randomly selected and the best weak rule is se- 
lected only among them. This modification sig- 
nificantly increases the efficiency of the learning 
process with no loss in accuracy. 

3 Setting 

The set of comparative experiments has been 
carried out on a subset of 21 words of the DSO 
corpus, which is a semantically annotated En- 
glish corpus collected by Ng and colleagues ( Ng 
[and Lee, 1996| ), and available from the Lin- 
guistic Data Consortium (LDC)q Each word 
is treated as a different classification problem. 
They are 13 nouns (age, art, body, car, child, 
cost, head, interest, line, point, state, thing, work) 
and 8 verbs (become, fall, grow, lose, set, speak, 
strike, tell). The average number of senses per 
word is close to 10 and the number of training 
examples is close to 1,000. 

The DSO corpus contains sentences from two 
different corpora, namely Wall Street Journal 
(WSJ) and Brown Corpus (BC). Therefore, it is 
easy to perform experiments about the portabil- 
ity of alternative systems by training them on 



pttp: //www.ldc .upenn.edu/ 



the WSJ part (A part, hereinafter) and testing 
them on the BC part (B part, hereinafter), or 
vice- versa. 

Two kinds of information are used to train 
classifiers: local and topical context. The for- 
mer consists of the words and part-of-speech 
tags appearing in a window of it 3 items around 
the target word, and collocations of up to three 
consecutive words in the same window. The 
latter consists of the unordered set of content 
words appearing in the whole sentence. 

4 Experiments 

4.1 Comparing the five approaches 

The five algorithms, jointly with a naive Most- 
Frequent-sense Classifier (MFC), have been 
tested on 7 different combinations of training- 
test set^. Accuracy figures, averaged over the 
21 words, are reported in table Q. The compar- 
ison leads to the following conclusions: 

Lazy Boosting outperforms the other three 
methods in all tests. The difference is statis- 
tically significant in all cases except when com- 
paring Lazy Boosting to the Exemplar Based ap- 
proach in the case marked with an asterisk^. 

Extremely poor results are observed when 
testing the portability of the systems. Restrict- 
ing to Lazy Boosting results, we observe that the 
accuracy obtained in A-B is 47.1% while the 
accuracy in B-B (which can be considered an 
upper bound for Lazy Boosting in B corpus) is 
59.0%, that is, a drop of 12 points. Further- 
more, 47.1% is only slightly better than the 
most frequent sense in corpus B, 45.5%. 

Apart from accuracy figures, the observation 
of the predictions made by the five methods on 
the test sets provides interesting information 
about the comparison of the algorithms. Ta- 
ble |2| shows the agreement rates and the Kappa 



^The combinations of training-test sets are called: 
A+B-A+B, A+B-A, A+B-B, A-A, B-B, A-B, and B-A, 
respectively. In this notation, the training set is placed 
at the left hand side of symbol "-", while the test set 
is at the right hand side. For instance, A-B means that 
the training set is corpus A and the test set is corpus B. 
The symbol "+" stands for set union. 

^Statistical tests of significance applied: McNemar's 
test and 10-fold cross-validati on paired Stude nt's f-test 
at a confidence value of 95% (Dietterich, 1996). 



(k) statistics^ between all pairs of methods in 
the A+B-A+B case. 'DSO' stands for the an- 
notation of DSO corpus, which is taken as the 
correct. Therefore the agreement rate with 
DSO contains the accuracy results previously 
reported. Some interesting conclusions can be 
drawn from those tables: 

1. NB obtains the most similar results with 
regard to MFC in agreement rate and Kappa 
values in all tables. The agreement ratio is 76%, 
that is, more than 3 out of 4 times it predicts 
the most frequent sense. 

2. LB obtains the most similar results with re- 
gard to DSO (accuracy) in agreement rate and 
Kappa values, and it has the less similar Kappa 
and agreement values with regard to M FC . This 
indicates that LB is the method that better 
learns the behaviour of the DSO examples. 

3. The Kappa values are very low. But, as 
it is suggested in ( [Veronis, 1998 ), evaluation 
measures, such as precision and recall, should 
be computed relative to the agreement between 
the human annotators of the corpus and not to 
a theoretical 100%. It seems pointless to expect 
more agreement between the system and the 
reference corpus than between the annotators 
themselves. Contrary to the intuition that the 
agreement between human annotators should 
be very high in the WSD task, some papers 
report surprisingly low figures. For instance. 



( Ng et al., 1999 ) reports an accuracy rate of 
56.7% and a Kappa value of 0.317 when com- 
paring the annotation of a subset of the DSO 
corpus performed by two independent research 
groups^. Similarly, ( Veronis, 1998|) reports val- 
ues of Kappa near to zero when annotating 
some special words for the ROMANSEVAL cor- 
pus^. From this point of view, the Kappa values 
of 0.44 achieved by LB in A+B-A+B could be 
considered excellent results. Unfortunately, the 
subset of the DSO corpus and that used in this 



''The Kappa statistic k ( Cohen, 1960 ) is a better mea- 
sure of inter-annotator agreement which reduces the ef- 
fect of chance agreement. It has been used for measur- 
ing inter-annotator agreement durin g the construc t ion 
of some sem antic annotated corpora (Veronis, 1996; Ng 
| et al., 1999| ). 



A Kappa value of 1 indicates perfect agreement, 
while 0.8 is co nsidered as indicating good agreement 
(ICarletta. 19961). 



" |ittp://www. lpl.univ-aix.fr/projects/romanseva] 









Acci 


J racy (%) 










A+B-A+B 


A+B-A 


A+B-B 


A-A 


B-B 


A-B 


B-A 


MFC 


46.55±o.7i 


53.90±2.oi 


39.21±i.90 


55.94±i.io 


45.52±i.27 


36.40 


38.71 


Naive Bayes 


61.55±i.04 


67.25±i.07 


55.85±i.8i 


65.86±i.ii 


56.80±i.i2 


41.38 


47.66 


Exemplar-based 


63.01±0.93 


69.08±i.66 


56.97±i.22 


68.98±i.06 


57.36±i.68 


45.32 


51.13 


Decision Lists 


61.58±o.98 


67.64±o.94 


55.53±i.85 


67.57±i.44 


56.56±i.59 


43.01 


48.83 


SNoW 


60.92±i.09 


65.57±i.33 


56.28±i.io 


67.12±i.i6 


56.13±i.23 


44.07 


49.76 


LazyBoosting 


66.32±i.34 


71.79±i.5i 


60.85±i.8i 


71.26±i.i5 


58.96±i.86 


47.10 


51.99* 



Table 1: Accuracy results (it standard deviation) of the methods on all training-test combinations 



report are not the same and, therefore, a direct 
comparison is not possible. 





A+B-A+B 




DSO 


MFC 


NB 


EB 


SN 


DL 


LB 


DSO 


— 


46.6 


61.6 


63.0 


60.9 


61.6 


66.3 


MFC 


-0.19 


— 


73.9 


60.0 


55.9 


64.9 


54.9 


NB 


0.24 


-0.09 


— 


76.3 


74.5 


76.8 


71.4 


EB 


0.36 


-0.15 


0.44 


— 


69.6 


70.7 


72.5 


SN 


0.36 


-0.17 


0.44 


0.44 


— 


67.5 


69.0 


DL 


0.32 


-0.13 


0.40 


0.41 


0.38 


— 


69.9 


LB 


0.44 


-0.17 


0.37 


0.50 


0.46 


0.42 


— 



Table 2: Kappa (k) statistic (below diagonal) 
and agreement rate (above diagonal) between 
all methods in A+B-A+B experiments 

4.2 About the tuning to new domains 

This experiment explores the effect of a sim- 
ple tuning process consisting of adding to the 
original training set a relatively small sample of 
manually sense tagged examples of the new do- 
main. The size of this supervised portion varies 
from 10% to 50% of the available corpus in steps 
of 10% (the remaining 50% is kept for testing). 
Results indicate that: LazyBoosting is again su- 
perior to their competitors. 

Summarizing, the results obtained show that 
for Naive Bayes, Exemplar Based, SNoW and 
Decision Lists methods it is not worth keeping 
the original training examples. Instead, a bet- 
ter (but disappointing) strategy would be sim- 
ply using the tuning corpus. However, this is 
not the situation of LazyBoosting, for which a 
moderate (but consistent) improvement of ac- 
curacy is observed when retaining the original 
training set. 

We observed that part of the poor results 
obtained is explained by: 1) Corpus A and 
B have a very different distribution of senses, 
and, therefore, different a-priori biases; Fur- 
thermore, 2) Examples of corpus A and B con- 



tain different information, and, therefore, the 
learning algorithms acquire different (and non 
interchangeable) classification cues from both 
corpora. The study of the rules acquired by 
LazyBoosting from WSJ and BC helped under- 
standing the differences between corpora. On 
the one hand, the type of features used in the 
rules were significantly different between cor- 
pora, and, additionally, there were very few 
rules that apply to both sets; On the other hand, 
the sign of the prediction of many of these com- 
mon rules was somewhat contradictory between 
corpora ( |Escudero et al., 2000c ). 



4.3 About the training data quality 

The observation of the rules acquired by Lazy- 
Boosting also could help improving data quality. 
It is known that mislabelled examples resulting 
from annotation errors tend to be hard exam- 
ples to classify correctly, and, therefore, tend to 
have large weights in the final distribution. This 
observation allows both to identify the noisy ex- 
amples and use LazyBoosting as a way to im- 
prove the training corpus. 

A preliminary experiment has been carried 
out in this direction by studying the rules ac- 
quired by LazyBoosting from the training exam- 
ples of word state. The manually revision of the 
50 highest scored rules allowed us to identify a 
high number of noisy training examples -there 
were 11 of 50 tagging errors-, and, additionally, 
17 examples of 50 not coherently tagged, prob- 
ably due to the too fine grained or not so clear 
distinctions between the senses involved in these 
examples. Thus, there were 28 of 50 examples 
with some problem, that is more than 1 of each 
two cases have a problem. 
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Figure 1: Results of the tuning experiment 



5 Conclusions and Future Work 

This work reports a comparative study of five 
ML approaches to WSD, and focuses on study- 
ing their portabiUty. The main conclusions are: 

Lazy Boosting algorithm outperforms the 
other four state-of-the-art supervised ML meth- 
ods in all domains tested. Furthermore, this al- 
gorithm shows better properties when is tuned 
to new domains. 

Portability is a very important issue that has 
been paid little attention up to the present. 
In this paper we show that a process of tun- 
ing to the domain of application is required 
to assure the portability of WSD systems (at 
least if the learning-testing corpora differ so 
as BC and WSJ do). This evidence questions 
the idea of "robust broad-coverage WSD" intro- 
duced by ( [Ng, 1997b| ), in which a supervised 
system trained on a large enough corpora (say 
a thousand examples per word) should provide 
fairly accurate disambiguation on any corpora. 
To determine the viability of the supervised ap- 
proach to WSD we belief that a serious effort 
should be devoted to study the problem of ob- 
taining representative enough training corpora 
at a reasonable cost. 



Further work is planned to be done in the 
following directions: 

1. Since most of the knowledge learned from 
a domain is not useful when changing to a 
new domain, further investigation is needed on 
tuning strategies, specially on those using non- 
supervised algorithms. 

2. It has been noted that mislabelled exam- 
ples resulting from annotation errors tend to be 
hard examples to classify correctly, and, there- 
fore, tend to have large weights in the final dis- 
tribution. It could provide the methodologies 
to automatic verify the semantic annotation of 
corpora and the grouping of senses. 

3. Moreover, the inspection of the rules 
learned by Lazy Boosting could provide evidence 
about similar behaviours of a-priori different 
senses. This type of knowledge could be use- 
ful to perform clustering of too fine-grained or 
artificial senses. 
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